1,473 research outputs found

    Genome classification by gene distribution: An overlapping subspace clustering approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genomes of lower organisms have been observed with a large amount of horizontal gene transfers, which cause difficulties in their evolutionary study. Bacteriophage genomes are a typical example. One recent approach that addresses this problem is the unsupervised clustering of genomes based on gene order and genome position, which helps to reveal species relationships that may not be apparent from traditional phylogenetic methods.</p> <p>Results</p> <p>We propose the use of an overlapping subspace clustering algorithm for such genome classification problems. The advantage of subspace clustering over traditional clustering is that it can associate clusters with gene arrangement patterns, preserving genomic information in the clusters produced. Additionally, overlapping capability is desirable for the discovery of multiple conserved patterns within a single genome, such as those acquired from different species via horizontal gene transfers. The proposed method involves a novel strategy to vectorize genomes based on their gene distribution. A number of existing subspace clustering and biclustering algorithms were evaluated to identify the best framework upon which to develop our algorithm; we extended a generic subspace clustering algorithm called HARP to incorporate overlapping capability. The proposed algorithm was assessed and applied on bacteriophage genomes. The phage grouping results are consistent overall with the Phage Proteomic Tree and showed common genomic characteristics among the TP901-like, Sfi21-like and sk1-like phage groups. Among 441 phage genomes, we identified four significantly conserved distribution patterns structured by the terminase, portal, integrase, holin and lysin genes. We also observed a subgroup of Sfi21-like phages comprising a distinctive divergent genome organization and identified nine new phage members to the Sfi21-like genus: <it>Staphylococcus </it>71, phiPVL108, <it>Listeria </it>A118, 2389, <it>Lactobacillus phi </it>AT3, A2, <it>Clostridium </it>phi3626, <it>Geobacillus </it>GBSV1, and <it>Listeria monocytogenes </it>PSA.</p> <p>Conclusion</p> <p>The method described in this paper can assist evolutionary study through objectively classifying genomes based on their resemblance in gene order, gene content and gene positions. The method is suitable for application to genomes with high genetic exchange and various conserved gene arrangement, as demonstrated through our application on phages.</p

    Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition

    Get PDF
    An approach to infer the unknown microbial population structure within a metagenome is to cluster nucleotide sequences based on common patterns in base composition, otherwise referred to as binning. When functional roles are assigned to the identified populations, a deeper understanding of microbial communities can be attained, more so than gene-centric approaches that explore overall functionality. In this study, we propose an unsupervised, model-based binning method with two clustering tiers, which uses a novel transformation of the oligonucleotide frequency-derived error gradient and GC content to generate coarse groups at the first tier of clustering; and tetranucleotide frequency to refine these groups at the secondary clustering tier. The proposed method has a demonstrated improvement over PhyloPythia, S-GSOM, TACOA and TaxSOM on all three benchmarks that were used for evaluation in this study. The proposed method is then applied to a pyrosequenced metagenomic library of mud volcano sediment sampled in southwestern Taiwan, with the inferred population structure validated against complementary sequencing of 16S ribosomal RNA marker genes. Finally, the proposed method was further validated against four publicly available metagenomes, including a highly complex Antarctic whale-fall bone sample, which was previously assumed to be too complex for binning prior to functional analysis

    Gene functionality's influence on the second codon: A large-scale survey of second codon composition in three domains

    Get PDF
    AbstractThe second codon of a transcript, besides encoding for an amino acid, is now known to also have multiple molecular functions and is involved in translation efficiency and protein turn-over and maturation processing. These multiple purposes therefore make the selection constraints on this codon's composition more complex. To examine the biological significance of various permutations of the second codon, we conducted a systematic survey of second codon composition from 442 selected genomes across three domains. The amino acid bias of the second codon is associated with specific protein functions. The most common amino acids (S, A, K and T) are significantly avoided in Cell Envelope-related genes but preferred in Translation or Energy Metabolism-related genes, suggesting that the function of a gene product is a significant factor influencing the composition of the second codon

    Accurate reconstruction of viral quasispecies spectra through improved estimation of strain richness

    Get PDF
    Background Estimating the number of different species (richness) in a mixed microbial population has been a main focus in metagenomic research. Existing methods of species richness estimation ride on the assumption that the reads in each assembled contig correspond to only one of the microbial genomes in the population. This assumption and the underlying probabilistic formulations of existing methods are not useful for quasispecies populations where the strains are highly genetically related. The lack of knowledge on the number of different strains in a quasispecies population is observed to hinder the precision of existing Viral Quasispecies Spectrum Reconstruction (QSR) methods due to the uncontrolled reconstruction of a large number of in silico false positives. In this work, we formulated a novel probabilistic method for strain richness estimation specifically targeting viral quasispecies. By using this approach we improved our recently proposed spectrum reconstruction pipeline ViQuaS to achieve higher levels of precision in reconstructed quasispecies spectra without compromising the recall rates. We also discuss how one other existing popular QSR method named ShoRAH can be improved using this new approach. Results On benchmark data sets, our estimation method provided accurate richness estimates (< 0.2 median estimation error) and improved the precision of ViQuaS by 2%-13% and F-score by 1%-9% without compromising the recall rates. We also demonstrate that our estimation method can be used to improve the precision and F-score of ShoRAH by 0%-7% and 0%-5% respectively. Conclusions The proposed probabilistic estimation method can be used to estimate the richness of viral populations with a quasispecies behavior and to improve the accuracy of the quasispecies spectra reconstructed by the existing methods ViQuaS and ShoRAH in the presence of a moderate level of technical sequencing errors

    Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages

    Get PDF
    BACKGROUND: Existing methods for whole-genome comparisons require prior knowledge of related species and provide little automation in the function prediction process. Bacteriophage genomes are an example that cannot be easily analyzed by these methods. This work addresses these shortcomings and aims to provide an automated prediction system of gene function. RESULTS: We have developed a novel system called SynFPS to perform gene function prediction over completed genomes. The prediction system is initialized by clustering a large collection of weakly related genomes into groups based on their resemblance in gene distribution. From each individual group, data are then extracted and used to train a Support Vector Machine that makes gene function predictions. Experiments were conducted with 9 different gene functions over 296 bacteriophage genomes. Cross validation results gave an average prediction accuracy of ~80%, which is comparable to other genomic-context based prediction methods. Functional predictions are also made on 3 uncharacterized genes and 12 genes that cannot be identified by sequence alignment. The software is publicly available at http://www.synteny.net/. CONCLUSION: The proposed system employs genomic context to predict gene function and detect gene correspondence in whole-genome comparisons. Although our experimental focus is on bacteriophages, the method may be extended to other microbial genomes as they share a number of similar characteristics with phage genomes such as gene order conservation

    Differential expression of lipoprotein genes in Mycoplasma pneumoniae after contact with human lung epithelial cells, and under oxidative and acidic stress

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Mycoplasma pneumoniae </it>is a human pathogen that is a common cause of community-acquired pneumonia. It harbours a large number of lipoprotein genes, most of which are of unknown function. Because of their location on the cell surface, these proteins are likely to be involved in the bacterial response to environmental changes, or in the initial stages of infection. The aim of this study was to determine if genes encoding surface lipoproteins are differentially expressed after contact with a human cell line, or after exposure to oxidative or acidic stress.</p> <p>Results</p> <p>Using qRT-PCR assays, we observed that the expression of a number of lipoprotein genes was up-regulated when <it>M. pneumoniae </it>was placed in contact with human cells. In contrast, lipoprotein expression was generally down-regulated or unchanged when exposed to either hydrogen peroxide or low pH (5.5). When exposed to low pH, the mRNA levels of four polycistronically transcribed genes in Lipoprotein Multigene Family 6 formed a gradient of decreasing quantity with increasing distance from a predicted promoter.</p> <p>Conclusion</p> <p>The demonstrated transcriptional changes provide evidence for the functionality of these mostly unassigned genes and indicate that they are regulated in response to changes in environmental conditions. In addition we have shown that the members of Lipoprotein Gene Family 6 may be expressed polycistronically.</p

    Mucus Sugar Content Shapes the Bacterial Community Structure in Thermally Stressed Acropora muricata

    Get PDF
    It has been proposed that the chemical composition of a coral’s mucus can influence the associated bacterial community. However, information on this topic is rare, and non-existent for corals that are under thermal stress. This study therefore compared the carbohydrate composition of mucus in the coral Acropora muricata when subjected to increasing thermal stress from 26°C to 31°C, and determined whether this composition correlated with any changes in the bacterial community. Results showed that, at lower temperatures, the main components of mucus were N-acetyl glucosamine and C6 sugars, but these constituted a significantly lower proportion of the mucus in thermally-stressed corals. The change in the mucus composition coincided with a shift from a γ-Proteobacteria- to a Verrucomicrobiae- and α-Proteobacteria-dominated community in the coral mucus. Bacteria in the class Cyanobacteria also started to become prominent in the mucus when the coral was thermally stressed. The increase in the relative abundance of the Verrucomicrobiae at higher temperature was strongly associated with a change in the proportion of fucose, glucose and mannose in the mucus. Increase in the relative abundance of α-Proteobacteria were associated with GalNAc and glucose, while the drop in relative abundance of γ-Proteobacteria at high temperature coincided with changes in fucose and mannose. Cyanobacteria were highly associated with arabinose and xylose. Changes in mucus composition and the bacterial community in the mucus layer occurred at 29°C, which were prior to visual signs of coral bleaching at 31°C. A compositional change in the coral mucus, induced by thermal stress could therefore be a key factor leading to a shift in the associated bacterial community. This, in turn, has the potential to impact the physiological function of the coral holobiont

    Using Growing Self-Organising Maps to Improve the Binning Process in Environmental Whole-Genome Shotgun Sequencing

    Get PDF
    Metagenomic projects using whole-genome shotgun (WGS) sequencing produces many unassembled DNA sequences and small contigs. The step of clustering these sequences, based on biological and molecular features, is called binning. A reported strategy for binning that combines oligonucleotide frequency and self-organising maps (SOM) shows high potential. We improve this strategy by identifying suitable training features, implementing a better clustering algorithm, and defining quantitative measures for assessing results. We investigated the suitability of each of di-, tri-, tetra-, and pentanucleotide frequencies. The results show that dinucleotide frequency is not a sufficiently strong signature for binning 10 kb long DNA sequences, compared to the other three. Furthermore, we observed that increased order of oligonucleotide frequency may deteriorate the assignment result in some cases, which indicates the possible existence of optimal species-specific oligonucleotide frequency. We replaced SOM with growing self-organising map (GSOM) where comparable results are obtained while gaining 7%–15% speed improvement

    Binning sequences using very sparse labels within a metagenome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In metagenomic studies, a process called binning is necessary to assign contigs that belong to multiple species to their respective phylogenetic groups. Most of the current methods of binning, such as BLAST, <it>k</it>-mer and PhyloPythia, involve assigning sequence fragments by comparing sequence similarity or sequence composition with already-sequenced genomes that are still far from comprehensive. We propose a semi-supervised seeding method for binning that does not depend on knowledge of completed genomes. Instead, it extracts the flanking sequences of highly conserved 16S rRNA from the metagenome and uses them as seeds (labels) to assign other reads based on their compositional similarity.</p> <p>Results</p> <p>The proposed seeding method is implemented on an unsupervised Growing Self-Organising Map (GSOM), and called Seeded GSOM (S-GSOM). We compared it with four well-known semi-supervised learning methods in a preliminary test, separating random-length prokaryotic sequence fragments sampled from the NCBI genome database. We identified the flanking sequences of the highly conserved 16S rRNA as suitable seeds that could be used to group the sequence fragments according to their species. S-GSOM showed superior performance compared to the semi-supervised methods tested. Additionally, S-GSOM may also be used to visually identify some species that do not have seeds.</p> <p>The proposed method was then applied to simulated metagenomic datasets using two different confidence threshold settings and compared with PhyloPythia, <it>k</it>-mer and BLAST. At the reference taxonomic level Order, S-GSOM outperformed all <it>k</it>-mer and BLAST results and showed comparable results with PhyloPythia for each of the corresponding confidence settings, where S-GSOM performed better than PhyloPythia in the ≥ 10 reads datasets and comparable in the ≥ 8 kb benchmark tests.</p> <p>Conclusion</p> <p>In the task of binning using semi-supervised learning methods, results indicate S-GSOM to be the best of the methods tested. Most importantly, the proposed method does not require knowledge from known genomes and uses only very few labels (one per species is sufficient in most cases), which are extracted from the metagenome itself. These advantages make it a very attractive binning method. S-GSOM outperformed the binning methods that depend on already-sequenced genomes, and compares well to the current most advanced binning method, PhyloPythia.</p
    • …
    corecore